-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
strobealign-aemb module for metagenomic binning #394
Conversation
Thanks Shaojun. I think Marcel is a bit busy right now but I am sure he will have a look when there is sufficient time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, here’s a partial review. Some additional comments:
- Tests are needed
- Documentation needs to be added (including a description of the file format).
- Please ensure that
--aemb
and-x
cannot be used at the same time.
src/aln.cpp
Outdated
std::vector<float> &abundance, | ||
int read1_len, | ||
int read2_len, | ||
bool output_abundance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of modifying this function, I believe it would be better to just add a new function, maybe named get_abundances
or so. You can copy the get_best_map_location
function and adjust it as necessary. That will add some duplication, but that can be fixed later.
Please make it so that the function returns the abundances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make it so that the function returns the abundances.
I take that back; that would be less efficient. It is ok to have this as a mutable in/out argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some more comments. Also, this needs documentation and tests.
src/aln.cpp
Outdated
std::vector<float> &abundance, | ||
int read1_len, | ||
int read2_len, | ||
bool output_abundance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make it so that the function returns the abundances.
I take that back; that would be less efficient. It is ok to have this as a mutable in/out argument.
This way, the strobealign sources are more self-contained and it should not be necessary to have internet access at build time.
See https://github.com/alugowski/poolSTL Sorting the randstrobes is currently a bottleneck in index generation as it does not run in parallel. This is an attempt to parallelize it. poolstl’s sort uses regular std::sort under the hood. We currently use pdqsort_branchless, which is about twice as fast as std::sort, so parallel sorting breaks even with pdqsort_branchless at about 2-3 threads. It gets faster with more threads, but not as much as one would perhaps expect. Here are the sorting runtimes for CHM13: - 31 s with pdqsort_branchless - 59 s with std::sort - 34 s with parallel sort, 2 threads - 24 s with parallel sort, 4 threads - 23 s with parallel sort, 8 threads Another issue is that sorting is no longer in place, so memory usage goes up by a couple of gigabytes, which is another reason for me not to make this change.
This version indeed gives us a very nice speed improvement without using extra memory: Sorting-only runtimes: * 1 thread: 32.8 s * 2 threads: 20.3 s * 4 threads: 15.7 s * 8 threads: 14.8 s Overall indexing runtimes (before/after): * 1 thread: 151 s → 153 s * 2 threads: 100 s → 88 s * 4 threads: 73 s → 57 s * 8 threads: 63 s → 47 s
... by sorting them also by position. This way, it does not matter in which way the randstrobes vector is partitioned during sort. Otherwise, the order would depend on the number of threads used to create the index, making mapping results not reproducible across runs that do not use the same no. of threads. Note: There is still a tiny chance for collisions/nondeterminism because we ignore RefRandstrobe::m_packed. For efficiency, this uses a branchless comparison function inspired by @alugowski’s comment in PR ksahlin#386. Runtimes for sorting with one thread 32 s - only by hash 45 s - by hash and position using std::tie 42 s - by hash and position using branchless_compare from the PR 35 s - by hash and position using __uint128_t (this commit) Runtimes for sorting (four cores with hyperthreading) threads | sorting time | index creation time -|-|- 1 | 35 s | 154 s 2 | 25 s | 95 s 4 | 16 s | 60 s 8 | 15 s | 48 s
Right now, strobealign only supports up to 2²⁴ sequences. If the user tries more, it would silently accept it, but later crash. This was triggered when trying to map to the Greengenes database https://ftp.microbio.me/greengenes_release/2022.10/ https://ftp.microbio.me/greengenes_release/2022.10/2022.10.seqs.fna.gz Even a single read like the one below trigger a crash ``` @M05314:127:000000000-BWLLJ:1:1101:15267:1654 2:N:0:1 CCTGTTCGCTCCCCACGCTTTCGTCCCTCAGCGTCAATATTGTGCCAGAATGCTGCCTTCGCCATTGGTGTTCCTCCTGATATCTACGCATGTCACCGCTACACCAGGAATTCCACATTCCTCTCACATATTCTATTTTATCAGTTTTGAT + AAA1AF@1>AAAGG1A0EAFGGEHAAEGFCG1AAEE/F2FG2F2FF1CA0FBDED1BGFGFFE?AF1BFFCFHDGFFHB1FFGFGEEFE/?/BF2F@/EGEEB00/0//0BFG1>B1BGFEFHHGGFFD12BGH2FDFFFGG22GDD>@/F ```
This is 3-state variable: SAM/PAF/Abundance
RFCT aemb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just two more tiny comments. Also, I will probably rearrange the documentation a little bit (clarifying the different output modes a bit) after this has been merged, but it’s probably easiest if it’s not part of this PR. I’ll also add a changelog entry later.
src/aln.cpp
Outdated
std::make_pair(std::cref(nams2), read2_len) }) { | ||
size_t best_score = 0; | ||
// We loop twice because we need to count the number of NAMs with best score | ||
for (auto &t : nams) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please rename t
to nam
?
(Also below in align_or_map_single
)
src/main.cpp
Outdated
@@ -105,6 +105,12 @@ InputBuffer get_input_buffer(const CommandLineOptions& opt) { | |||
} | |||
} | |||
|
|||
void output_abundance(std::vector<double> abundances, References references){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both parameters should be passed by const reference
Co-authored-by: Marcel Martin <[email protected]>
This looks good now! I think this deserves a slightly longer section in the documentation, but I’ll do that after merging. |
Hi,
This is the aemb (abundance estimation for metagenomic binning) module, the usage is :
strobealign ref.fa read1.fq read2.fq --aemb > abun.txt
In my benchmarking, aemb can output similar results, but decrease more than 90% running time.
Sincerely
Shaojun